Skip to content

Conversation

@casteryh
Copy link
Contributor

@casteryh casteryh commented Nov 1, 2025

Summary:
This diff adds a FIFO waiting mechanism for acquiring the queue pair.

True concurrency support will come in follow-up diff: this will require we track wr_id and create a polling task for each queue pair, which will need further refactoring.

Limitations

  • No true concurrency support: only one request can be in-flight at a time per queue pair connection. However, subsequent requests now wait fairly (FIFO) instead of panicking with "already checked out" errors.
  • request_queue_pair/release_queue_pair is still not cancel safe (if an operation times out, the queue pair might not be returned and subsequent requests will fail)

Core changes:

  1. Fair Waiting via Semaphore: Replaced Available/CheckedOut states with Connecting/Ready/ConnectionError. Added QueuePairEntry wrapping RdmaQueuePair + Arc. request_queue_pair uses two-phase approach (get/create QP, then acquire semaphore permit for FIFO fairness).

  2. Refactor - Moved read_into/write_from to RdmaManagerActor: Prevents deadlock in actor message queue. Old design had RdmaBuffer call request_queue_pair RPC → perform operation → call release_queue_pair RPC, causing release messages to queue behind waiting requests. Now entire operation (request → use → release) happens within single actor message handler.

  3. Refactored Connection Logic: Extracted establish_connection helper handling both loopback and remote connections.

  4. Test: Added create_buffer_pair method and concurrent tests.

Differential Revision: D85627877

Summary:
Allow us to test the bandwidth of concurrent rdma operations.

actual concurrency support will be added later in the stack.

Differential Revision: D85724514
Summary:
This diff adds a FIFO waiting mechanism for acquiring the queue pair.

**True concurrency support will come in follow-up diff**: this will require we track wr_id and create a polling task for each queue pair, which will need further refactoring.

**Limitations**
- No true concurrency support: only one request can be in-flight at a time per queue pair connection. However, subsequent requests now wait fairly (FIFO) instead of panicking with "already checked out" errors.
- request_queue_pair/release_queue_pair is still not cancel safe (if an operation times out, the queue pair might not be returned and subsequent requests will fail)

Core changes:
1. **Fair Waiting via Semaphore**: Replaced Available/CheckedOut states with Connecting/Ready/ConnectionError. Added QueuePairEntry wrapping RdmaQueuePair + Arc<Semaphore>. request_queue_pair uses two-phase approach (get/create QP, then acquire semaphore permit for FIFO fairness).

2. **Refactor - Moved read_into/write_from to RdmaManagerActor**: Prevents deadlock in actor message queue. Old design had RdmaBuffer call request_queue_pair RPC → perform operation → call release_queue_pair RPC, causing release messages to queue behind waiting requests. Now entire operation (request → use → release) happens within single actor message handler.

3. **Refactored Connection Logic**: Extracted establish_connection helper handling both loopback and remote connections.

5. **Test**: Added create_buffer_pair method and concurrent tests.

Differential Revision: D85627877
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 1, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 1, 2025

@casteryh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85627877.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants